Introduction

The goal of this session is to give you a taste of the different features of the DataScience Platform, including conducting analyses, publishing reports, scheduling scripts, and deploying models.

Imagine that you are a data scientist at a company that has to perform dynamic inventory management. An example of that would be a ride-sharing company where you want to know which parts of a city to direct your drivers to depending on the time of day and other factors like the weather.

Here we’ll perform some analysis in Jupyter and then publish these findings as a Report that a business user will find easy to consume.

Load data

The data is in data/processed_uber_nyc.RData and contains two dataframes:

  1. agg_data

  2. zone_polys

About the data

The source of the data for this exercise is the Uber Pickups in New York City dataset by FiveThirtyEight. Pickup data for 20 million pickeups are aggregated by hour, date, and taxi zone (i.e., an approximate neighborhood) and enriched with calendar and weather data. More detailed information about each dataframe is below.

agg_data

This dataframe contains information about the number of pickups.

Fields:

  • locationID: unique ID for each taxi zone

  • date

  • hour: 24H format

  • borough: Borough that the zone is located in (e.g. Manhattan, Boorklyn, Queens)

  • zone: Name of the taxi zone (e.g. Times Sq, Chinatown, Central Harlem)

  • picksups: Number of pickups

  • day: Day of week (e.g. Mon, Tue, Wed)

  • is_holiday: Whether that day was a holiday (Boolean)

  • mean_temp_F: Mean temperature that day in Fahrenheit

## Source: local data frame [3 x 9]
## Groups: locationID, date, hour, borough [3]
## 
##   locationID       date  hour borough           zone pickups   day
##        <int>      <chr> <chr>  <fctr>         <fctr>   <int> <chr>
## 1          1 2014-04-01    03     EWR Newark Airport       2   Tue
## 2          1 2014-04-01    04     EWR Newark Airport       4   Tue
## 3          1 2014-04-01    05     EWR Newark Airport       4   Tue
## # ... with 2 more variables: is_holiday <lgl>, mean_temp_F <int>

zone_polys

This is a dimension table that describes the boundaries of each taxi zone.

Fields:

  • long: Longitude

  • lat: Latitude

  • order: Rank of point when drawing boundary

  • hole: Whether to plot a hole in that location (Boolean)

  • piece: The piece of the zone that the point is associated with

  • id: ID of zone. Same as locationID in agg_data

  • group: Group that the point belongs to

##        long      lat order  hole piece id group
## 1 -74.18445 40.69500     1 FALSE     1  1   0.1
## 2 -74.18449 40.69509     2 FALSE     1  1   0.1
## 3 -74.18450 40.69518     3 FALSE     1  1   0.1

Exploratory analysis

What areas experience the highest demand?

Insights:

  • Lower Mahanttan experiences highest demand

  • Demand is also high at airports (JFK and La Guardia)

  • Not much activity in the outer boroughs

Feature Engineering

It looks like many of the neighborhoods show similar pickup patterns. By clustering the neighborhoods, we will likely improve predictive and computational performance of the model.

To cluster the neighborhoods, we will perform k-means clustering on the hourly pickup patterns. We will use the elbow method to pick the most suitable number of clusters.

4 appears to be the most appropriate number of neighborhood clusters. Let’s visualize the clusters.